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Abstract 

Distance weighted discrimination (DWD) was originally proposed to handle the 
data piling issue in the support vector machine. In this paper, we consider the sparse 
penalized DWD for high-dimensional classification. The state-of-the-art algorithm for 
solving the standard DWD is based on second-order cone programming, however such 
an algorithm does not work well for the sparse penalized DWD with high-dimensional 
data. In order to overcome the challenging computation difficulty, we develop a very 
efficient algorithm to compute the solution path of the sparse DWD at a given fine 
grid of regularization parameters. We implement the algorithm in a publicly available 
R package sdwd. We conduct extensive numerical experiments to demonstrate the 
computational efficiency and classification performance of our method. 
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1 Introduction 


The support vector machine (SVM) (Vapnik, 1995) is a widely used modern classification 
method. In the standard binary classification problem, training dataset consists of n pairs, 
{(xj,?/i)}™ =1 , where x, G M p and y l G { — 1,1}. The linear SVM seeks a hyperplane {x : 
0o + x T /3 = 0} which maximizes the smallest margin of all data points: 

arg max min di, 

Ad ,/3 i 

subject to di = yi(/3 0 + xf/3) + r/, ; > 0, Vi, 

n 

Vi > 0, Vi, < c, \\(3\\l = 1, (1.1) 

i=l 

where di is defined as the margin of the ith data point, rji s are slack variables introduced to 
ensure all margins non-negative, and c > 0 is a tuning parameter controlling the overlap. By 
using a kernel trick, the SVM can also produce nonlinear decision boundaries by fitting an 
optimal separating hyperplane in the extended kernel feature space. The readers are referred 
to Hastie et al. (2009) for a more detailed explanation of the SVM. 

Marron et al. (2007) noticed that when the SVM is applied on some data with n < p, 
many data points lie on two hyperplanes parallel to the decision boundary. Marron et al. 
(2007) referred to this phenomenon as data pilling and claimed that the data pilling can 
“affect the generalization performance of SVM”. To overcome this issue, Marron et al. 
(2007) proposed a new method called the distance weighted discrimination (DWD), which 
finds a separating hyperplane minimizing the sum of the inverse margins of all data points: 

arg min > 1 /di, 

0°>0 V 

subject to di = yi(/3 0 + xf/3) + ry > 0, Vi, 

r} i >0,Vi,'%2vi<c,\\P\\l = l. ( 1 . 2 ) 

l 

The initial version of Marron et al. (2007) also mentioned the sum of the inverse margins 
JVl/di could be also replaced by the qt\i power of the inverse margins, and this 

generalized version was used as the definition of the DWD in Hall et al. (2005). Marron et al. 
(2007) asserted the DWD can avoid the data piling and thereby improve the generalizability. 
One example [see the group 2 of Figure 3 in Marron et al. (2007)] shows that the DWD has 
about 5% prediction error whereas the SVM does 15%. Enhancement of the DWD over the 


2 


SVM can also be exemplified in Hall et al. (2005) through a novel geometric view. As for the 
computation of the DWD, Marron et al. (2007) observed that the DWD is an application of 
the second-order cone programming and thus can be solved by the primal-dual interior-point 
methods. The algorithm has been implemented in both Matlab code http: //www. unc. edu/ 
~marron/marron_software .html and an R package DWD (Huang et ah, 2012). 

In this paper we focus on classification with high-dimensional data where the number 
of covariates is much larger than the sample size. The standard SVM and DWD are not 
suitable tools for high-dimensional classification for two reasons. First, based on the scientific 
hypothesis that only a few important variables affect the outcome, a good classifier for 
high-dimensional classification should have the ability to select important variables and 
discard irrelevant ones. However, the standard SVM and DWD use all variables and do 
not conduct variable selection. Second, because these two classifiers use all variables, they 
may have very poor classification performance. As explained in Fan and Fan (2008), the bad 
performance is caused by the error accumulation when estimating too many noise variables 
in the classifier. Owing to these two considerations, sparse classifiers are generally preferred 
for high-dimensional classification. In the literature, some penalties have been applied to 
the SVM to produce sparse SVMs such as the 0 SVM (Bradley and Mangasarian, 1998; 
Zhu et ah, 2004), the SCAD SVM (Zhang et ah, 2006), and the elastic-net penalized SVM 
(Wang et ah, 2006). 

In this work we consider sparse penalized DWD for high dimensional classification. The 
standard DWD uses the 0 penalty and can be solved by the second-order cone programming. 
However, the sparse DWD is computationally more challenging and requires a different com¬ 
puting algorithm. To cope with the computational challenges associated with the sparse 
penalty and high-dimensionality, we derive an efficient algorithm to solve the sparse DWD 
by combining majorization-minimization principle and coordinate-descent. We have imple¬ 
mented the algorithm in an R package sdwd. To give a quick demonstration here, we use the 
prostate cancer data [Singh et ah (2002), 102 observations and 6033 genes] as an example. 
The left panel of Figure 1 depicts the solution paths of the elastic-net penalized DWD, and 
sdwd only took 0.453 second to compute the whole solution path. As comparison, we also 
used the code in Wang et ah (2006) to compute the solution path of the elastic-net penalized 
SVM. We observed that the timing of the sparse SVM was about 290 times larger than that 
of the sparse DWD. 
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Figure 1. The solution paths for the prostate data (n = 102, p = 6033,) using the elastic-net DWD 
and the elastic-net SVM. In every method, A 2 is fixed to be 1. The dashed vertical lines indicate 
the X\ selected by the five-folder cross validation. Both timings are averaged over 10 runs. 


2 Sparse DWD 

In this section we present several sparse penalized DWDs. Our formulation follows the i\ 
SVM (Zhu et ah, 2004). Thus, we first review the derivation process of the l\ SVM. The 
standard SVM (1.1) is often rephrased as the following quadratic programming problem 
(Hastie et ah, 2009): 

arg min \\(3\\% 

Po,P 

subject to yi(f3 0 + xf/3) + rji > 1, Vi, 

n 

Vi > 0, Vi, ^Vi < c. 

i=l 

Moreover, the above constrained minimization problem has an equivalent loss+penalty for¬ 
mulation (Hastie et ah, 2009): 


1 A 

arg min - ^ [l - 2/i(/?o + xf/3)] + + -f\\P\\l 
Po,P n i=1 z 

The loss function [1 — 1]+ = max(l — t, 0) is the so-called hinge loss in the literature. For the 
high-dimensional setting, the standard SVM uses all variables because of the ^ norm penalty 
used therein. As a result, its performance can be very poor. Zhu et al. (2004) proposed the 
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^-norm SVM to fix this issue: 


1 v— 

arg min-V [l - + xf/3)] + Ai||/3||i- 

M 12 i=l 

Similarly, we can propose the £\ penalized DWD. It has been shown that the standard 
DWD also has a loss+penalty formulation (Liu et ah, 2011): 


1 

arg min - YV (y { (f3 0 + xf/3)) + 

0o,P n 



2 

2 ) 


where the loss function is given by 


V(n) 


1 — u, if u < 1/2, 
l/(4u), if w > 1/2. 


Similar to the £\ SVM, we replace the £ 2 norm penalty with the £\ norm penalty in order to 
achieve sparsity in the DWD classifier. Hence, the i\ DWD is defined by 

1 n 

(/3 0 (lasso),/3(lasso)) = arg min ~^V {yi(/3 0 + xf/3)) + Ai||/3|| x . (2.1) 

V J 0o,P n 


The lasso penalized DWD classification rule is Sign(/3 0 (lasso) + x r /3(lasso)). 

Besides the £i norm penalty, we also consider the elastic-net penalty (Zou and Hastie, 
2005). It is now well-known that the elastic-net often outperforms the lasso norm penalty) 
in prediction. Wang et al. (2006) studied the elastic-net penalized SVM (DrSVM) and showed 
that the DrSVM performs better than the t\ norm SVM. Similarly, we propose the clastic-net 
penalized DWD: 

1 n 

(A)(enet),/3(enet)) = arg min - Vh(i/,(ft + xf/3)) + P AliA2 (/3), (2.2) 


where 

Pxux,(0) = V (vlftl + y^ 2 

The elastic-net penalized DWD classification rule is Sign(/3 0 (enet) + x T /3(enet)). Both Ai 
and A 2 are important tuning parameters for regularization. In practice, Ai and A 2 are chosen 
from finite grids by validation or cross-validation. 
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A further refinement of the elastic-net penalty is the adaptive clastic-net penalty (Zou 
and Zhang, 2009) where we replace the l\ (lasso) penalty with the adaptive l\ (lasso) penalty 
(Zou, 2006). The adaptive lasso penalty produces estimators with the oracle properties. The 
adaptive elastic-net enjoys the benefits of elastic-net and adaptive lasso. After fitting the 
elastic-net penalized DWD, we further consider the adaptive clastic-net penalized DWD: 


/3 0 (aenet), /3 (aenet) 


1 n p / \ \ 

arg min - I '/(yi(Po + x f/3)) + ( X ^i\Pi\ + ~WPj ) > ( 2 - 3 ) 

A’d n i= i j=i V Z / 


and the adaptive weights are computed by 


Cjj = (|/3j(enet)| + l/n) 1 , 

where /d,(enet) is the solution of /3j in (2.2). The adaptive elastic-net penalized DWD 
classification rule is Sign(/3 0 (aenet) + x 7 /3(aenet)). 


3 Computation 

The l 2 DWD was solved based on the second-order-cone programming; nevertheless, it is 
not trivial to generalize the algorithm to the £± DWD, and even more difficult to handle the 
elastic-net and the adaptive elastic-net penalties. In this section, we propose a completely 
different algorithm. We solve the solution paths of the sparse DWD by using the generalized 
coordinate descent (GCD) algorithm proposed by Yang and Zou (2013). We introduce the 
GCD algorithm in section 3.1, the implementation in section 3.2, and the strict descent 
property in section 3.3. The same algorithm solves all the G, the elastic-net, and adaptive 
elastic-net penalized DWDs, while only the elastic-net is focused in the discussion for the 
sake of presentation. 

3.1 Derivation of the algoithm 

Without loss of generality, we assume that the variables Xj are standardized: i x ij = 

0, f Xu=i x ij = I; f° r 3 = 1) • • • iP- W e fix Ai and A 2 and let Ui = yi(/3 0 + xf/3). We focus on 
/3j’s first. For each [3j, we define the coordinate-wise update function: 

1 n 

F{f3j\(3, fio) = (ui + ViXijiPj - fa)') +P\ 1 ,\ 2 (Pj)- (3.1) 

i=l 
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Then the standard coordinate descent algorithm suggests cyclically updating 


Pj = arg min F((3j\f3 0 , (3) 

Pi 


(3.2) 


for each j = 1 However, (3.2) does not have a closed-form solution. The GCD 

algorithm solves this issue by adopting the MM principle (Hunter and Lange, 2004). We 
approximate the F function by a quadratic function 


Q{Pj lA A)) = 


liu v^) , e;Li v>( Uj ) yi x, 


+ 


n 


n 


(Pj ~ Pj) + ^(Pj — pj ) 1 2 3 + Px 1 ,x 2 (Pj)- (3-3) 


Then we update Pj by Pf ew , the closed-form minimizer of (3.3): 


pne W = 


S \ MPj - 4 Y)i= 1 V(upyiXij, A] 
4 + A 2 


(3.4) 


where S(z,r) = sign(^)(|^| — r) + is the soft-thresholding operator (Donoho and Johnston, 
1994) and u + = max(o;,0) is the positive part of oj. 

With the intercept similarly updated, Algorithm 1 summarizes the details of the GCD 
algorithm. 


Algorithm 1 The GCD algorithm for the sparse DWD 

1. Initialize (/3o,/3). 

2. Cyclic coordinate descent, for j = 1, 2 ,,p: 

(a) Compute iq = yfPo + x^/3). 

(b) Compute /3j iew = ^ • S (4/3 7 - - 4 YX=i VPufy.x^, A x ) . 

(c) Set Pj = /3f w . 

3. Update the intercept term: 

(a) Compute Ui = y ? ;(/3 0 + xf/3). 

(b) Compute /3 q 6W = /3 0 - YPi=i 

(c) Set p 0 = PS ew - 

4. Repeat steps 2-3 until convergence of (/3 0 ,/3). 
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3.2 Implementation 


We have implemented Algorithm 1 in an R package sdwd. We exploit the warm-start, the 
strong rule, and the active set trick to increase the algorithm speeding. In our implementa¬ 
tion, A 2 is pre-chosen and we compute the solution path as Ai varies. 

First, we adopt the warm-start to lead to a faster and more stable algorithm (Friedman 
et ah, 2007). We compute the solutions at a grid of K decreasing Ai values, starting at 
the smallest Ai value such that (3 = 0. Denote these grid points by A^,...,A^. With 
the warm-start trick, we can use the solution at A^ as the initial value (the warm-start) to 
compute the solution at A-f +1 l 

Specifically, to find A^, we fit a model with a sufficiently large Ai and thus /3 = 0. Let /5 q 


be the estimate of the intercept. By the KKT conditions, f maxj Y0' =x V{0)l/i x ij] 
so we can choose 


A Ai, 


x [i] J- 
Ai = — max 
n j 


n 

E 

2—1 


V\0 )ViXij) 


Generally, we use K = 100, and A^ 100 ^ = eA^, where e = 10“ 4 when n < p and e = 1CK 2 
otherwise. All the other grid points are placed to uniformly distribute on a log scale. 

Second, we follow the strong rule (Tibshirani et ah, 2010) to improve the computational 
speed. Suppose (0^ and 00 are the solutions at \f\ After we solve 0^ and 00 , the strong 
rule claims that any j G {1,... ,p} satisfying 


1 71 

~ Y V '(Vi0o ] + X J f3 [k] ))Vi x ij 


< 2A l ; 


[k+i] 


0 


[k] 


(3.5) 


is likely to be inactive at A^ +1 ^, i.e., 00 + ^ = 0. Let V be the collection of j which satisfies 
(3.5), and its compliment D c = {1,... ,p}\D. We call T> c the survival set. If the strong rule 
guesses correctly, the variables contained in T> are discarded, and we only apply Algorithm 1 
to repeat the coordinate descent in the survival set T> c . After computing the solution 0 
and [3, we need to check whether some variables are incorrectly discarded. We check this by 
the KKT condition, 


1 v—^ 

- Y v '(y^ + xl iP))yi x n 


< A, 


(3.6) 


If no j e V violates (3.6), 0 and /3 are the solutions at Af +1 ^. We rephrase them as 00 +1 ^ 
and f3 . Otherwise, any incorrectly discarded variable should be added to the survival 










set V c . We update V by V = V/U where 


U 


j : j G T> and 


1 . 

- v 'Wo + xJ$))yiXij 



After each update of V, some incorrectly discarded variables are added back to the survival 
set. 

Third, the active set is also used to boost the algorithm speed. After we apply Algorithm 1 
on the survival set V c , we only apply the coordinate descent on a subset S of V c till 
convergence, where S — {j : j £ T> c and f3j Oj. Then another cycle of coordinate descent 
is run on D c to investigate if the active set S changes. We finish the algorithm if no changes 
in S ; otherwise, we update the active set S and repeat the process. 

I 11 Algorithm f, the margin Ui can be updated conveniently: if /3j is updated by /3f ew , we 
update Ui by + yiXififif™ - fy). 

Last, the default convergence rule in sdwd is 4(/3“ ew — j3j ) 2 < 1CT 8 for all j — 0,1,... ,p. 


3.3 The strict descent property of Algorithm 1 

Yang and Zou (2013) showed the GCD algorithm enjoys descent property. In this section, 
we also show the GCD algorithm has a stronger statement, the strict descent property, when 
the GCD is used to solve the sparse DWD. We first elaborate the following majorization 
result, whose proof is deferred in the appendix. 

Lemma 1. F(f3j\/3, /3 0 ) is the coordinate-wise update function defined in (3.1), and Q (fij |/3, f3 0 ) 
is the surrogate function defined in (3.3). We have (3.7) and (3.8): 


F(l3j\pJo) =Q(l3j\pJ 0 ), iffa=h (3-7) 

F(Pj\P,h)<Q(Pi\frh), (3-8) 

Given /3J iew = arg min^. Q(/3j\/3 0 ,/3), and assuming f3f ew d 3 , (3.7) and (3.8) imply the 
strict descent property of the GCD algorithm: F(Bf ew \/3. f3 0 ) < F({3j\(3, fi 0 ). It is because 
F(/3f w \(3j 0 ) < Q(/3f w |/3,/3 0 ) < QiPfipJo) = F0, |/3,/3 0 ). Note that the original GCD 
paper only showed F0f ew |/3,/3 0 ) < F0j |/3,/3 0 ). 

The arguments above prove that the objective function F strictly decreases after updating 
all variables in a cycle, unless the solution does not change after each update. If this is the 
case, the algorithm stops. We show that the algorithm must stop at the right answer. 
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Assuming /3j = /3J ew for all j, (3.4) implies: 


Pi 



A straightforward algebra can show that for all j, 


1 x A 

- v '( u i)Vi x ij + Aisign(^) + A 2 Pj = 0, if /3j % 0; 


i =1 



if (3 3 = 0, 


2=1 


which is exactly the KKT conditions of the original objective function (2.2). In conclusion, 
if the objective function does not change after a cycle, the algorithm necessarily converges 
to the correct solution satisfying the KKT condition. 

4 Simulation 

The simulation in this section aims to support the following three points: (1) the sparse 
DWD has highly competitive prediction accuracy with the sparse SVM and the sparse lo¬ 
gistic regression; (2) the adaptive elastic-net penalized DWD performs the best in variable 
selection; (3) for the prediction accuracy, no single method among the £i, the elastic-net, 
and the adaptive elastic-net penalized DWDs dominate the others in all situations. 

In this section, the response variables of all the data are binary. The dimension p of the 
variables x,; is always 3000. Within each example, our simulated data consist of a training 
set, an independent validation set, and an independent test set. The training set contains 
50 observations: 25 of them are from the positive class and the other 25 from the negative 
class. Models are fitted on the training data only, and we use an independent validation set 
of 50 observations to select the tuning parameters: A 2 is selected from 10 -4 , 10~ 3 , 10 -2 , 0.1, 
1, 5, and 10; Ai is searched along the solution paths. We compared the prediction accuracy 
(in percentage) on another independent test data set of 20,000 observations. 

We followed Marron et al. (2007) to generate the first two examples. In example 1, the 
positive class is a random sample from N p (fi + , I p ), where I p is the p by p identity matrix and 
pi + has all zeros except for 2.2 at the first dimension; the negative class is from N p (pt_,I p ) 
with = —/i + . In example 2, 80% of the data are generated from the same distributions 
as example 1; for the other 20% of the data, the positive class is drawn from N p (p, + J I p ) 
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and negative class N p (—^j, + . I p ) where n + = (100, 500, 0,..., 0). We obtained the other 
three examples following Wang et al. (2006). In example 3, the positive class has a normal 
distribution with mean [i + and covariance X = I pxp , where [i + has 0.7 in the first five 
covariates and 0 in others; the negative class has the same distribution except for a different 
mean = — ii + . In example 4 and 5, we consider the cases where the relevant variables 
are correlated. Two classes have the same distributions except for the covariance, 

^5x5 05 x(p- 5 ) 

0(p—5)x5 -*(p— 5)x(p—5) 

In example 4, the diagonal elements of X* are 1 and the off-diagonal elements are all equal 
to 0.7. In example 5, the (i,j )th element of X* equals 0.7l l_J 'L 

We compared the sparse DWD with the sparse SVM and the sparse logistic regression. 
Both the DWD and the logistic regression use the A, the elastic-net and the adaptive elastic- 
net penalties. We used R packages sdwd and gcdnet (Yang and Zou, 2013) to compute the 
sparse DWDs and the sparse logistic regressions respectively. The l\ and the elastic-net 
SVMs were solved by using the code from Wang et al. (2006) which does not handle the 
adaptive elastic-net penalty. Table 1 presents the prediction accuracy results. In the first two 
examples, the t\ DWD and the i\ logistic regression perform the best. We attribute this good 
performance to the only one nonzero variable in the data, despite 20% of outliers in example 
2. In example 3, 4, and 5, we increase the number of nonzero variables to five. For all models, 
the elastic-net and the adaptive elastic-net penalties have similar performance, and both of 
them dominate the l\ penalties. The clastic-net DWD produces the least prediction error in 
example 4 and 5. Table 3 compares the variable selection. In all cases, the adaptive elastic- 
net penalties address all relevant variables with relatively few mistakes. The i\ penalties 
share similar performance in the first two examples. 

5 Real Data Examples 

In this section we analyze four benchmark data. The data Arcene was obtained from Frank 
and Asuncion (2010), the breast cancer data from Graham et al. (2010), the LSVT data from 
Tsanas et al. (2014), and the prostate cancer was from Singh et al. (2002). We randomly 
split each data with a ratio 1:1 into a training set and a test set. On the training set, we 
fit the sparse DWD with imposing the elastic-net and the adaptive elastic-net penalties. 
With the same tuning parameter candidates in the simulation, we used a five folder cross 
validation to find the best pair of (Ai, A 2 ) incurring the least mis-classihcation rate. Then we 
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Table 1. Comparisons of mis-classification percentage on 300 training data, 300 validation data, 
and 20,000 test data, based on 200 replicates. The numbers in parentheses are the standard errors. 
For each example, the methods with the best performance are marked by black boxes. 




DWD 


SVM 


logistic 



k 

enet 

aenet 

k 

enet 

k 

enet 

aenet 

Example 1 

1.42 

1.47 

1.44 

1.46 

1.50 

1.42 

1.46 

1.44 

Bayes: 1.39 

(0.01) 

(0.02) 

(0.01) 

(0.01) 

(0.02) 

(0.01) 

(0.02) 

(0.02) 

Example 2 

1.14 

1.15 

1.13 

1.16 

1.16 

1.11 

1.14 

1.15 

Bayes: 1.11 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.02) 

Example 3 

6.41 

6.25 

6.21 

6.45 

6.15 

6.40 

6.21 

6.22 

Bayes: 5.88 

(0.03) 

(0.03) 

(0.03) 

(0.04) 

(0.03) 

(0.03) 

(0.03) 

(0.03) 

Example 4 

22.05 

21.48 

21.54 

22.03 

21.56 

22.00 

21.54 

21.64 

Bayes: 21.10 

(0.07) 

(0.07) 

(0.05) 

(0.06) 

(0.05) 

(0.06) 

(0.06) 

(0.06) 

Example 5 

18.91 

18.74 

18.75 

18.84 

18.78 

18.81 

18.80 

18.77 

Bayes: 18.03 

(0.07) 

(0.05) 

(0.05) 

(0.06) 

(0.05) 

(0.06) 

(0.05) 

(0.05) 


investigated the prediction accuracy of the selected model on the test set. As comparisons, 
we considered the sparse SVM and the sparse logistic regression. Every method was trained 
and tuned in the same way as the sparse DWD. All numerical experiments were carried out 
on an Intel Core i7-3770 (3.40 GHz) processor. 

In Table 3, we reported the average mis-classihcation percentage on the test set from 
200 independent splits. We observe that the classifiers achieving the least error in these four 
datasets are the adaptive elastic-net logistic regression, the elastic-net SVM, the elastic-net 
and the adaptive elastic-net DWDs. We also find all the differences are not quite large. For 
the sparse DWD, we get the same message as Marron et al. (2007) concluded for the standard 
DWD: “it very often is competitive with the best of the others and sometimes is better.” 
We also notice that the computation of the sparse DWD is the fastest in almost all cases. 
The timing of the SVM is much longer than other methods. A possible explanation is that 
the SVM uses the non-differentiable hinge loss function which makes the GCD algorithm 
not suitable for solving the sparse SVM. So far, the best algorithm for the sparse SVM is a 
LARS type algorithm Wang et al. (2006), which is very different from the GCD algorithm for 
the sparse DWD and logistic regression. It has been observed that coordinate descent may 
be faster than the LARS algorithm for solving the lasso penalized least squares (Friedman 
et ah, 2007). 
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Table 2. Comparisons of the variable selection. C is the number of selected nonzero variables, and 
IC is the number of zero variables incorrectly selected into the model. The results are the medians 
over 200 replicates. 





DWD 




SVM 




logistic 



C 

h 

IC 

enet 

C IC 

anet 

C IC 

C 

h 

IC 

enet 

C IC 

C 

h 

IC 

enet 

C IC 

aenet 

C IC 

Example 1 

1 

0 

I 

2 

1 

0 

1 

0 

1 

4 

I 

0 

1 

4.5 

I 

0 

Example 2 

1 

0 

1 

0 

1 

0 

1 

0 

1 

1 

1 

0 

1 

0 

1 

0 

Example 3 

5 

0 

5 

5 

5 

0 

5 

0 

5 

2.5 

5 

1 

5 

7 

5 

0 

Example 4 

4 

1 

5 

8.5 

5 

1.5 

4 

0 

5 

7 

4 

1 

5 

14 

5 

2 

Example 5 

4 

1 

5 

3.5 

5 

0 

4 

0 

5 

2 

4 

1 

5 

6.5 

5 

0 


6 Discussion 

In this article, we have proposed the sparse DWD for high-dimensional classification and 
developed an efficient algorithm to compute its solution path. We have shown that the 
sparse DWD has competitive prediction performance with the sparse SVM and the sparse 
logistic regression and is often faster to compute with the help of our algorithm. Thus, the 
sparse DWD is a valuable addition to the toolbox for high-dimensional classification. 

The generalized DWD defined in Hall et al. (2005) minimizes the qt\i power of the inverse 
margins. When q — 1, it reduces to the usual DWD. For computation considerations, Mar- 
ron et al. (2007) choose to fix q = 1, because it leads to a second order cone programming 
problem. We have found that our algorithm can be readily used to solve the sparse general¬ 
ized DWD with any positive q. In our numerical study we tried the generalized DWD with 
q = 0.5,1, 2, 5,100 and also tried to use cross-validation to select a data-driven q value. Our 
numeric results indicated that using different q values does not lead to significant differences 
in performance. We opt to leave those results to the technical report version of this paper. 
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Table 3. The mean mis-classification percentage and timings (in seconds) for four benchmark 
datasets. All the timings include the five-folder cross validation. The timings of adaptive elastic- 
net methods include computing the weights. The numbers in parentheses are the standard errors. 
For each data, the methods with the best prediction accuracy are marked by black boxes. 



Arcene 

n = 100, p = 10000 

Breast 

n = 42, p — 22283 

LSVT 

n = 126, p — 309 

Prostate 

7i — 102, p — 6033 


error 

time 

error 

time 

error 

time 

error 

time 

enet DWD 

34.43 

123.41 

26.50 

58.40 

16.01 

8.28 

110.22 

28.18 


(0.56) 

(5.16) 

(1.00) 

(1.90) 

(0.34) 

(0.23) 

(0.30) 

(0.95) 

aenet DWD 

34.60 

200.19 

26.86 

116.12 

115.92 

13.72 

10.26 

39.25 


(0.57) 

(9.24) 

(1.00) 

(3.78) 

(0.34) 

(0.29) 

(0.26) 

(1.24) 

enet logistic 

34.16 

211.18 

24.67 

145.35 

16.96 

10.73 

10.65 

102.19 


(0.58) 

(3.40) 

(1.00) 

(0.74) 

(0.37) 

(0.18) 

(0.29) 

(1.56) 

aenet logistic 

| 34.15 

393.03 

25.12 

290.31 

16.93 

17.02 

10.75 

189.44 


(0.57) 

(6.52) 

(0.87) 

(1.47) 

(0.37) 

(0.29) 

(0.29) 

(2.84) 

enet SVM 

35.10 

7410.09 

23.95 | 

567.43 

16.27 

63.10 

10.56 

2508.94 


(0.67) 

(1465.68) 

(1.00) 

(15.19) 

(0.37) 

(0.77) 

(0.36) 

(0.77) 


Appendix 

Proof of Lemma 1 (3.7) is trivial. To prove (3.8), it suffices to show for any a b 6 M, 

V(a) < V(b) + V'(b)(a - h) + 2(a - h) 2 . (6.1) 

First, it is not hard to check that the first-order derivative V'(f) is Lipschitz continuous, i.e., 
for any a b, 

\V\a) — V'(b)\ < 4|a — b\. (6.2) 

Let g(a) = 2a 2 — V{a), then (6.2) shows g'(a ) = 4a — V'(a ) is strictly increasing. Therefore 
g(a) is a strictly convex function, and its first-order condition leads to (6.1) directly. 
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